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Abstract — In the classic Bayesian restless multi-armed bandit 
(RMAB) problem, there are N arms, with rewards on all arms 
evolving at each time as Markov chains with known parameters. 
A player seeks to activate K > 1 arms at each time in order 
to maximize the expected total reward obtained over multiple 
plays. RMAB is a challenging problem that is known to be 
PSPACE-hard in general. We consider in this work the even 
harder non-Bayesian RMAB, in which the parameters of the 
Markov chain are assumed to be unknown a priori. We develop 
an original approach to this problem that is applicable when 
the corresponding Bayesian problem has the structure that, 
depending on the known parameter values, the optimal solution 
is one of a prescribed flnite set of policies. In such settings, we 
propose to learn the optimal policy for the non-Bayesian RMAB 
by employing a suitable meta-policy which treats each policy from 
this finite set as an arm in a different non-Bayesian multi-armed 
bandit problem for which a single-arm selection policy is optimal. 
We demonstrate this approach by developing a novel sensing 
policy for opportunistic spectrum access over unknown dynamic 
channels. We prove that our policy achieves near-logarithmic 
regret (the difference in expected reward compared to a model- 
aware genie), which leads to the same average reward that can 
be achieved by the optimal policy under a known model. This is 
the first such result in the literature for a non-Bayesian RMAB. 
For our proof, we also develop a novel generalization of the 
Chernoff-Hoeffding bound. 

Index Terms — restless bandit, regret, opportunistic spectrum 
access, learning, non-Bayesian 



I. Introduction 

MULTI-ARMED bandit (MAB) problems are fundamen- 
tal tools for optimal decision making in dynamic, un- 
certain environments. In a multi-armed bandit problem, there 
are N independent arms each generating stochastic rewards, 
and a player seeks a policy to activate K > 1 arms at each 
time in order to maximize the expected total reward obtained 
over multiple plays. A particularly challenging variant of 
these problems is the restless multi-armed bandit problem 

This research was sponsored in part by the U.S. Army Research Laboratory 
under the Network Science Collaborative Technology Alliance, Agreement 
Number W91 lNF-09-2-0053. The work of Q. Zhao was supported by the 
Army Research Office under Grant W91 lNF-08-1-0467. This is an extended, 
full version of a paper that appeared in ICASSP 2011 (T). 



(RMAB) (T], in which the rewards on all arms (whether or 
not they are played) evolve at each time as Markov chains. 

MAB problems can be broadly classified as Bayesian or 
non-Bayesian. In a Bayesian MAB, there is a prior distribution 
on the arm rewards that is updated based on observations at 
each step and a known-parameter model for the evolution of 
the rewards. In a non-Bayesian MAB, a probabilistic belief 
update is not possible because there is no prior distribution 
and/or the parameters of the underlying probabilistic model 
are unknown. In the case of non-Bayesian MAB problems, the 
objective is to design an arm selection policy that minimizes 
regret, defined as the gap between the expected reward that can 
be achieved by a genie that knows the parameters, and that 
obtained by the given policy. It is desirable to have a regret 
that grows as slowly as possible over time. In particular, if 
the regret is sub-linear, the average regret per slot tends to 
zero over time, and the policy achieves the maximum average 
reward that can be achieved under a known model. 

Even in the Bayesian case, where the parameters of the 
Markov chains are known, the restless multi-armed bandit 
problem is difficult to solve, and has been proved to be 
PSPACE hard in general (|3]- One approach to this problem has 
been Whittle's index, which is asymptotically optimal under 
certain regimes |4|; however it does not always exist, and 
even when it does, it is not easy to compute. It is only in 
very recent work that non-trivial tractable classes of RMAB 
where Whittle's index exists and is computable have been 
identified §\ M- 

We consider in this work the even harder non-Bayesian 
RMAB, in which the parameters of the Markov chain are 
assumed to be unknown a priori. Our main contribution in this 
work is a novel approach to this problem that is applicable 
when the corresponding Bayesian RMAB problem has the 
structure that the parameter space can be partitioned into a 
finite number of sets, for each of which there is a single 
optimal policy. For RMABs satisfying this finite partition 
property, our approach is to develop a meta-policy that treats 
these policies as arms in a different non-Bayesian multi-armed 
bandit problem for which a single arm selection policy is 
optimal for the genie, and learn from reward observations 
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which policy from this finite set gives the best performance. 

We demonstrate our approach on a practical problem per- 
taining to dynamic spectrum sensing. In this problem, we 
consider a scenario where a secondary user must select one 
of N channels to sense at each time to maximize its expected 
reward from transmission opportunities. If the primary user 
occupancy on each channel is modeled as an identical but 
independent Markov chain with unknown parameters, we 
obtain a non-Bayesian RMAB with the requisite structure. 
We develop an efficient new multi-channel cognitive sensing 
policy for unknown dynamic channels based on the above 
approach. We prove for = 2, 3 that this policy achieves 
regret (the gap between the expected optimal reward obtained 
by a model-aware genie and that obtained by the given policy) 
that is bounded uniformly over time n by a function that grows 
as 0{G{n) ■ logn), where G{n) can be any arbitrarily slowly 
diverging non-decreasing sequence. For the general case, this 
policy achieves the average reward of the myopic policy 
which is conjectured, based on extensive numerical studies, 
to be optimal for the corresponding problem with known 
parameters. This is the first non-Bayesian RMAB policy that 
achieves the maximum average reward defined by the optimal 
policy under a known model. 

II. Related Work 

A. Bayesian MAB 

The Bayesian MAB takes a probabilistic viewpoint toward 
the system unknown parameters. By treating the player's a 
posterior probabilistic knowledge (updated from the a priori 
distribution using past observations) on the unknown param- 
eters as the system state. Bellman in 1956 abstracted and 
generalized the Bayesain MAB to a special class of Markov 
decision processes (MDP) |7|. Specifically, there are N inde- 
pendent arms with fully observable states. One arm is activated 
at each time, and only the activated arm changes state as per 
a known Markov process and offers a state-dependent reward. 
This general MDP formulation of the problem naturally leads 
to a stochastic dynamic programming solution based on back- 
ward induction. However, such an approach incurs exponential 
complexity with respect to the number of arms. The problem 
of finding a simpler approach remained open till 1972 when 
Gittins and Jones |8| presented a forward-induction approach 
in which an index is calculated for each arm depending only on 
the process of that arm, and the arm with the highest index at 
its current state is selected at each time. This result shows that 
arms can be decoupled when seeking the optimal activation 
rule, consequently reducing the complexity from exponential 
to linear in terms of the number of arms. Several researchers 
have since developed alternative proofs of the optimality of 
this approach, which has come to be known as the Gittins- 
index ||9|- ||T6| . Several variants of the basic classical Bayesian 
MAB have been proposed and investigated, including arm- 
acquiring bandits ifTTI . superprocess bandits ||9l, ifTSl . bandits 
with switching penalties |fT9l . Il20l . and multiple simultaneous 
plays 1 21), 122). 

A particularly important variant of the classic MAB is the 
restless bandit problem posed by Whittle in 1988 )23), in 



which the passive arms also change state (to model system dy- 
namics that cannot be directly controlled). The structure of the 
optimal solution for this problem in general remains unknown, 
and has been shown to be PSPACE-hard by Papadimitriou 
and Tsitsiklis fl^. Whittle proposed an index policy for 
this problem that is optimal under a relaxed constraint of 
an average number of arms played as well as asymptotically 
under certain conditions )'25l; for many problems, this Whittle- 
index policy has numerically been found to offer near-optimal 
performance. However, Whittle index is not guaranteed to 
exist. Its existence (the so-called indexability) is difficult to 
check, and the index can be computationally expensive to 
calculate when it does exist. General analytical results on 
the optimality of Whittle index in the finite regime have 
also eluded the research community up to today. There are 
numerical approaches for testing indexability and calculating 
Whittle index (see, for example, [26 ), )27)). Constant-factor 
approximation algorithms for restless bandits have also been 
explored in the Uterature 111, ll29l) . 

Among the recent work that contributes to the fundamental 
understanding of the basic structure of the optimal policies 
for a class of restless bandits with known models, myopic 
policy [f30l) - ll32l has a simple semi-universal round-robin 
structure. It has been shown that the myopic policy is optimal 
for iV = 2, 3, and for any N in the case of positively correlated 
channels. The optimality of the myopic policy for > 3 
negatively correlated channels is conjectured for the infinite- 
horizon case. Our work provides the first efficient solution to 
the non-Bayesian version of this class of problems, making 
use of the semi-universal structure identified in ll30l . 

B. Non-Bayesian MAB 

In the non-Bayesian formulation of MAB, the unknown 
parameters in the system dynamics are treated as deterministic 
quantities; no a priori probabilistic knowledge about the 
unknowns is required. The basic form of the problem is the 
optimal sequential activation of N independent arms, each as- 
sociated with an i.i.d. reward process with an unknown mean. 
The performance of an arm activation policy is measured by 
regret (also known as the cost of learning) defined as the 
difference between the total expected reward that could be 
obtained by an omniscient player that knows the parameters 
of the reward model and the policy in question (which has to 
learn these parameters through statistical observations). Notice 
that with the reward model known, the omniscient player will 
always activate the arm with the highest reward mean. The 
essence of the problem is thus to identify the best arm without 
exploring the bad arms too often in order to minimize the 
regret. In particular, it is desirable to have a sub-linear regret 
function with respect to time, as under this condition the time- 
averaged regret goes to zero, and the slower the regret growth 
rate, the faster the system converges to the same maximum 
average reward achievable under the known-model case. 

Lai and Robbins 1331 proved in 1985 that the lower 
bound of regret is logarithmic in time, and proposed the first 
policy that achieved the optimal logarithmic regret for non- 
Bayesian MABs in which the rewards are i.i.d over time and 
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obtained from a distribution that can be characterized by a 
single-parameter. Anantharam et al. extended this resuh to 
multiple simultaneous arm plays, as well as single-parameter 
Markovian rested rewards ll34l . If35l . Other policies achieving 
logarithmic regret under different assumptions about the i.i.d. 
reward model have been developed by Agrawal [36] and 
Auer et al. [!37|. In particular, Auer et al.'s UCBl policy 
applies to i.i.d. reward distributions with finite support, and 
achieves logarithmic regret with a known leading constant 
uniformly bounded over time. 

The focus of this paper is on the non-Bayesian RMAB. 
There are two parallel investigations on non-Bayesian RMAB 
problems given in ll38l . ||39l . where a more general RMAB 
model is considered but under a much weaker definition 
of regret. Specifically, in lf38l . ||39 |, regret is defined with 
respect to the maximum reward that can be offered by a 
single ann/channel. Note that for RMAB with a known 
model, staying with the best arm is suboptimal in general. 
Thus, a sublinear regret under this definition does not imply 
the maximum average reward, and the deviation from the 
maximum average reward can be arbitrarily large. In contrast 
to these works, this paper shows sublinear regret with respect 
to the maximum reward that can be obtained by the optimal 
policy played by a genie that knows the underlying transition 
matrix. 

III. A New Approach for non-Bayesian RMAB 

In multi-arm bandit problem, there are multiple arms and 
each of them yields a stochastic reward when played. The 
player sequentially picks one arm at each time, aiming to 
maximize the total expected reward collected over time. If 
the rewards on all arms are modeled as Markov chains and 
all arms always keep activated whether they are selected, it 
is classified as restless multi-armed bandit problem (RMAB). 
In Bayesian RMAB, the parameters of the Markov chain are 
known and in non-Baysian RMAB, the model for the reward 
process is a priori unknown to the user. 

We first describe a structured class of finite-option Bayesian 
RMAB problems that we will refer to as ^,n- Let B[P) be 
a Bayesian RMAB problem with the Markovian evolution of 
arms described by the transition matrix P. We say that B{P) E 
^rn if and only if there exists a partition of the parameter 
values P into a finite number of m sets {51,52, ■■■Sm} and 
a set of policies tt^ (Vi = 1 . . . m) with tt^ being optimal 
whenever P E Si. Despite the general hardness of the RMAB 
problem, problems with such structure do exist, as has been 
shown in S, JS^, lISTl . 

We propose a solution to the non-Bayesian version of the 
problem that leverages the finite solution option structure of 
the corresponding Bayesian version {B{P) E ^m)- In this 
case, although the player does not know the exact parameter 
P, it must be true that one of the m policies tt; will yield 
the highest expected reward (corresponding to the set Si that 
contains the true, unknown P). These policies can thus be 
treated as arms in a different non-Bayesian multi-armed bandit 
problem for which a single-arm selection policy is optimal 
for the genie. Then, a suitable meta-policy that sequentially 



operates these policies while trying to minimize regret can be 
adopted. This can be done with an algorithm based on the 
well-known schemes proposed by Lai and Robbins [33J , and 
Auer et al ll37l . 

One subtle issue that must be handled in adopting such an 
algorithm as a meta-policy is how long to play each policy. 
An ideal constant length of play could be determined only 
with knowledge of the underlying unknown parameters P. To 
circumvent this difficulty, our approach is to have the duration 
for which each policy is operated slowly increase over time. 

In the following, we demonstrate this novel meta-policy ap- 
proach using the dynamic spectrum access problem discussed 
in ||30l , II3TI where the Bayesian version of the RMAB has 
been shown to belong to the class 4*2. For this problem, we 
show that our approach yields a policy with provably near- 
logarithmic regret, thus achieving the same average reward 
offered by the optimal RMAB policy under a known model. 

IV. Dynamic Spectrum Access under Unknown 
Models 

We consider a slotted system where a secondary user is 
trying to access N independent channels, with the availability 
of each channel evolving as a two-state Markov chain with 
identical transition matrix P that is a priori unknown to the 
user. The user can only see the state of the sensed channel. 
If the user selects channel i at time t, and upon sensing finds 
the state of the channel Si{t) to be 1, it receives a unit reward 
for transmitting. If it instead finds the channel to be busy, i.e., 
Si{t) = 0, it gets no reward at that time. The user aims to 
maximize its expected total reward (throughput) over some 
time horizon by choosing judiciously a sensing policy that 
governs the channel selection in each slot. We are interested 
in designing policies that perform well with respect to regret, 
which is defined as the difference between the expected reward 
that could be obtained using the omniscient policy tt* that 
knows the transition matrix P, and that obtained by the given 
policy TT. The regret at time n can be expressed as: 

i?(P, m,n) = KLiF-* (P, t)] 

-E"KUy-(p,i7(i),t)], 

where uji{l) is the initial probability that 5^(1) = 1, P 
is the transition matrix of each channel, (P,J7(l),t)] 
is the reward obtained in time t with the optimal policy, 
Y'^(P, ^l{l),t) is the reward obtained in time t with the given 
policy. We denote — [uji{t), . . . ,ujN{t)] as the belief 

vector where uji{t) is the conditional probability that Si{t) = 1 
(and let r2(l) = . . . ,ujn{1)] denote the initial belief 

vector used in the myopic sensing algorithm fSOl). 

V. Policy Construction 

As has been shown in |30|, the myopic policy has a simple 
structure for switching between channels that depends only 
on the correlation sign of the transition matrix P, i.e. whether 
Pii > Poi (positively correlated) or pu < poi (negatively 
correlated). 

In particular, if the channel is positively correlated, then the 
myopic policy corresponds to 
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• Policy TTi: stay on a channel whenever it shows a "1" and 
switch on a "0" to the channel visited the longest ago. 

If the channel is negatively correlated, then it corresponds 

to 

• Policy TT2'. staying on a channel when it shows a "0", 
and switching as soon as "1" is observed, to either the 
channel most recently visited among those visited an even 
number of steps before, or if there are no such channels, 
to the one visited the longest ago. 

To be more specific, we give the structure of myopic 
sensing f30|. In myopic sensing, the concept circular order 
is very important. A circular order k = {ni,n2,--- ,rijv) 
is equivalent to {rii, n^+i, • • • , njv, ni,n2, • • • , rii-i) for any 
1 < i < N. For a circular order k, denote — k as its reverse 
circular order For a channel i, denote i+ as its next channel 
in the circular order k. With these notations, we present the 
structure of the myopic sensing. 

Let ft{l) = ,ujn{^)] denote the initial belief 

vector. The circular order k(1) in time slot 1 depends on 
the order of ft{l): k(1) = (ni,n2,--- ,?ijv) implies that 
Wni(l) < a;„2(l) < ••• < a;„„(l). Let a{t) denote myopic 
action in time t. We have a(l) — argmaxi=i 2. -- ,jv 1^1(1) and 
for t > 1, the myopic action d{t) is given as follows. 

. Policy 7ri(pii > poi): 

a{t-l),ifSait~i){t-l) = l 

where K{t) = 
Policy 7r2(pii < poi): 

ait ~l),ifSait-i)it- 1)^0 

where K{t) = k{1) when t is odd and K{t) = 
when t is even. 

Furthermore, as mentioned in section [III it has been shown 
in II30I . II3TI that the myopic policy is optimal for N = 2, 3. As 
a consequence, this special class of RMAB has the required 
finite dependence on its model as described in section IIVI 
specifically, it belongs to ^1*2. We can thus apply the general 
approach described in section |IV] Specifically, the algorithm 
treats these two policies as arms in a classic non-Bayesian 
multi-armed bandit problem, with the goal of learning which 
one gives the higher reward. 

A key question is how long to operate each arm at each 
step. The analysis we present in the next section shows that it 
is desirable to slowly increase the duration of each step using 
any (arbitrarily slowly) divergent non-decreasing sequence of 
positive integers {i^nj^i. 

The channel sensing policy we thus construct is shown in 
Algorithm [T] in which we use the UCB 1 policy proposed by 
Auer et al. in |,37J as the meta-policy. 

VI. Regret Analysis 

We first define the discrete function G(n) which represents 
the value of Ki at the n*'* time step in Algorithm [T] 

/ 



a{t) = 



Gin) = minifr 
I 



s.t. 



i=l 



' K,>n 



(2) 



Algorithm 1 Sensing Policy for Unknown Dynamic Channels 
1: // Initialization 

2: Let {Kn}^^i be any arbitrarily slowly divergent non- 
decreasing sequence of positive integers. 

3: Play policy tti for Ki times, denote Ai as the sample 
mean of these Ki rewards 

4: Play policy tt2 for K2 times, denote A2 as the sample 
mean of these K2 rewards 

5: Xi ^ Al, ±2 ^ A2 

6: n = Kx^ K2 

7: i = 3, Zi = 1, 12 = 1 

8: // Main loop 
9: while 1 do 

10: Find j e {1,2} such that j 

(L can be any constant larger than 2) 

11: + l 

12: Play policy tt^ for Ki times, let Ajiij) record the 
sample mean of these Ki rewards 



arg max ^ 



Linn 



.,-X,+A,{i,) 



n ^ 71 - 
end while 



Note that since Ki can be any arbitrarily slow non-decreasing 
diverging sequence, G{n) can also grow arbitrarily slowly. 

The following theorem states that the regret of our policy 
grows close to logarithmically with time. 

Theorem 1: For the dynamic spectrum access problem with 

= 2, 3 i.i.d. channels with unknown transition matrix P, the 
expected regret with Algorithm [T] after n time steps is at most 
ZiG{n) \n{n) + Z2 h\{n) + Z'iG{n) + Zi, where Zi, Z2, ^'s, Z4 
are constants only related to P. 

The proof of Theorem [T] uses one fact and two lemmas. 

Fact 1: (Chernoff-Hoeffding bound |40|) Let Xi, • • • ,X„ 
be random variables with common range [0, 1] such that 
E[Xt|Xi, • • • ,Xt_i] = 11. Let 5„ = + • • • + Xn. Then 
for all a > 

P{5„ > n^+a} < e-2'^'/";P{S'„ < n^x-a] < g-^"'/" (3) 

Our first lemma is a non-trivial generalization of the 
Chernoff-Hoeffding bound, that allows for bounded differ- 
ences between the conditional expectations of a sequence of 
random variables that are revealed sequentially: 

Lemma 1: Let Xi, - ■ ■ , Xn be random variables with range 
[0,6] and such that \E[Xt\Xi, ■ ■ ■ ,Xt-i] - /i| < C. C is a 
constant number such that < C < fi. Let Sn — Xi + ■ ■ ■ + 
Xn- Then for all a > 0, 



"{Sn >n{fi + C)+a}<e 



"^l. b(u + C) > /" 



and 



'{Sn < nifi -C)-a}< e-2('^/b)Vn 



(4) 



(5) 



Proof: 



We first prove (01). We generate random variables 
Xi,X2, - ■ ■ , as follows: 

^i = (a* + C)i^. 
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X2 



E[X2|Xi 



Xt 



^2 - (Ai + C) 

Note that \¥\Xt\Xi,- ■ ■ ,Xt-i] - < C. Therefore 
\¥.[Xt\Xi,- ■ ■ ,Xt-i] ~ ^JL\ < C also stands. Hence is 
at least 1, at most ^0^- Therefore Xi,X2,''' ,Xn have 
finite support (they are in the range [0, b^^]). Besides, 

E[Xt\Xi,--- ,Xt-i] = fi + c,yt. 

Let Sn = Xi + X2 -\ h X„, then for all a > 0, 

¥{Sn > n{fi + C)+a} < F{Sn > n{fi + C) + a} 



< e 



(6) 



The first inequality stands because ^ > l,Vt. The second 
inequality stands because of Fact 1 . 

The proof of dU is similar. We generate random variables 
,1; as follows: 

X{ = - C)gp^, 



x„ 



Note that \E[Xt\Xi, ■ ■ ■ , Xt-i] - ^l\ < C, 
L[X^\X[,--- ,Xi_i] - /u| < C also stands. So ^ is 



at most 1, at least jj-^- Therefore, Xi,X2,--- ,Xn 
have finite support(They are in the range [0, b]). Besides, 

E[xi\xi,--- ,xu] = ^,~c, yt. 

Let S'„= X[+X^ + ---+ X,;, then for all a > 0, 



^{Sn < n{fi ~C)~a}< P{5; < n{fi - C) - a} 

< e-2(a/&)Vn^ 



(7) 



Xi 



The first inequality stands because ^ < 1,V<. The second 
inequality stands because of Fact 1 . ■ 

The second lemma states that the expected loss of reward 
for either policy due to starting with an arbitrary initial belief 
vector compared to the reward Ui{P) that would obtained by 
playing the policy at steady state is bounded by a constant 
Ci(P) that depends only on the policy used and the transition 
matrix. 

Specifically, denote 



C/,(P,0(1)) 



lim 

T-s-oo 



T 



(8) 



From the previous work in |5J, we know that the limit 
above exists and the steady-state throughput Ui(P,ft{l)) is 
independent of the initial belief value So, we can rewrite 
Ui{P, as Ui{P), to denote the average expected reward 

with transition matrix P using policy TTi{i — 1,2). 

Lemma 2: For any initial belief vector and any pos- 
itive integer M, if we use policy tt^ (i = 1,2) for M times, 
and the summed expectation of the rewards for these M steps 
is denoted as E^- [SfiiF^- (P, rj(l), t)], then 

\E^'[^fi^Y^^{P,n{l),t)] - M ■ U,{P)\ < aiP) (9) 

Proof: 

As described in section |IV] let K{t) — {ni,n2,--- ,7t,jv) 
{rii E {1, 2, . . . , A^}, Vi) be the circular order of channels 
in slot t, we then have an ordered channel states S{t) = 



[Sni {t) , (i) 7 • • • J Snj^ (t)] . In fact, after sensing at time t— 1 
, channel sequence {ni,n2,--- ,njv} has a non-decreasing 
probability of being in state 1, i.e. Wnj(t) > Wn2(0 ^ 
■••w„„(0. 

The status S{t) form a 2^-state Markov chain with transi- 
tion probabilities {Qjj} shown as follows: 
for policy tti. 



N 



Y[Ptk.3k^ ifii = l 



fe=i 

N 

PiljN JJ^ Pikjk- 
k=2 



(10) 



if ii 



and for policy 712, 



N 

fc=i 

N 

Piu3iY[Ptk,jN-k+2^ ifii=0 

fc=2 



(11) 



where i = [ii, 12, • • • , in], j = [ji, j2, ■ ■ • Jn], they are two 
ordered channel state vectors with entries equal to or 1. 
Denote the probability vector of Markov chain formed by 

S{t) as V{t) = [vi{t), V2{t), - ■ ■ , V2N {t)]. Then we have 



V{t) ^ V{1) ■ (Q)* 



(12) 



In the tth step, the myopic policy selects the channel of the 
first component in S{t), therefore we only have to calculate 
the probability of states whose first component is 1. There 
are 2^^^ such states and denote them as Zi, ^2, ■ ■ ■ 1 '2«-i' we 
have 



E[r^i(p,r!(i),i)] = E vu{t) 

i=l 

We can diagonalize Q as below: 

Q = B.-'^diagiXi, • • • , A2iv)H 



(13) 



(14) 



From Perron Frobenius theorem BTl . we know that |Ai| < 
l,Vi and at least one eigenvalue is 1. Without loss of gener- 
ality, we assume 1 = Ai = • • • = Aa > |Aa+i| > • • • > |A2iv |, 
with ([nil, we can rewrite E[Y''^{P,n{l),t)] as 



2™-^ , a 2" 



E[Y'''{P,n{l),t)] = " I]_ /ijA* = E_ ; E _ /ijA* (15) 



■ 1 ■ 1 



j=a+l 



where hj is the corresponding coefficient which is only related 
to P. 

The steady average throughput Ui{P){i = 1, 2) is E°=i hj. 
As for different policies, the transition matrix Q is different, 
thus A vector and coefficient hj have different expressions. 
We denote them as Ui{P) and U2{P) respectively. 

Based on the same reason, for different policies, we have 
different expressions for T,j=a+i each denoted as 

Ci(P) and C2(P) respectively. 
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Besides, we have: 



t=l 



3 I 



t=l 
It 



2^' 

= S |;i,|I]^>li|A,f < E |/»,|S^i|A,| 

j=a+l j=a+l 
2"- 

< E 



(16) 



I A, I 



So |E-'[Ef£ir-'(P,r!(l),t)]-Af-[/,(P)| <a(P), z = 
1,2. ■ 
Proof of Theorem Q} 

We first derive a bound on the regret for the case when 
Poi < Pii- In this case, policy vri would be the optimal. Based 
on Lemma 121 the difference of E'^i [EJ'^iF'^i (P, wi, cj2, i)] 
and nUi is no more than Ci, therefore we only need to prove: 

i?'(P, , C.2, = nUi - E'^i [SjLiF., (P, c^i , c^2, i)] 
< ZiG(n) ln(n) + Z2 ln(n) + Z3G(n) + Z4 - Ci 

where Zi, Z2, 2^3, are constants only related to P. 

The regret comes from two parts: the regret when using 
policy 772; the regret between Ui and (P, wi, a;2, when 
using policy tti. From Lemma |2] we know that each time 
when we switch from policy tti to policy 712, at most we lose 
a constant-level value from the second part. So if the times of 
policy being used is bounded by 0{G{n) Inn), both parts 
of the rewards can be bounded by 0{G{n) Inn). 

For case of exposition, we discuss the slots n such that 
G\\n, where G\\n denotes that time n is the end of successive 
G(n) plays. 

We define q as the smallest index such that 



Kg > max{ [ 



G1+C2 
\Ui - U2 



■l,G2/C/2,Gi/C/i} 



(17) 



Let ct,s = \/(Llnt)/s , wi = q{Ui ~ ^) and 



W2 



U2 - C2/Kg 



{U2 



C2 

Ka 



1) 



-U2+C2/Kq 

Next we will show that it is possible to define a{Ui, Ci, P) 
such that if policy tti is played s > a times, 

exp(-2(wi - sct.sf/{s - q)) < i"*. 



(18) 



In fact, we have 



^Ls — wi > \/2{s — q) 



when s > max{g, \wi/ {^/L — \/2)]^} 
Consider 



f{t) = ^/Ls In t-wi- ^/2{s -q)lnt V< > e 

It is quite obvious that f{t) is a increasing function. Since 
/(e) > 0, we have f{t) > 0,Vi > e, i.e. 

V Ls In t — wi > ■\/2(s ~ q)\TLt 

which equals to 

exp(-2(wi - sct^a)'^l{s - q)) < 

Thus at least we can set 

a(C/i,Gi,P) =max{q, \wi/{Vl - V2)Y} 



For the similar reason, we could define 

/3(C/2, C2, P) = max {q, \w2/{Vl - V2)f } 
such that if policy tt2 is played s > (3 times, 

s-q 

Moreover, we will show that there exists 
7 = [max{5a + 1, e^"^^ + a, 5^ + 1, e^^l^ + /?}] such that 
when G(n) > K^, policy tti is played at least a times and 
policy 7r2 is played at least f3 times. 

In fact, if policy tti has been played less than a times, then 
policy 7r2 has been played at least (4a + 2) times. Consider 
the last time selecting policy 7r2, there must be 



Ct,ii < — h Ct,i2 



«2 



(20) 



Noting that > 0, < 1, ii < a - 1,12 > 4a + 1, we 

have 



Lint 
a-1 



< 1 



Lint 
4a + 1 



Consider 



git) = 1 + 



Lint 



Lint 



4a + 1 V a - 1 

Since g{t) is a decreasing function and i > 7 — a > e^"^^, 
we have 

which contradicts the conclusion above. So policy tti has been 
played at least a times. For the similar reason, policy tt2 is 
played at least f3 times. 

Denote T{n) as the number of times we select policy 7r2 
up to time n. Then, for any positive integer /, we have 



r(n) = 1 + 



^ ' hit) 

t=Ki+K2,G\\t ^ ' 



,rXlit) Mt) 



Ct,i2(t)} 

< ; + 7+ 

E E E 

t=A'l^ ^Ky,G\\t si=Q S2=max{P,l) 

H 1- Ct,si < h Ct,s2} 



Si 



S2 



(21) 

where l{x} is the index function defined to be 1 when the 
predicate x is true, and when it is false predicate; ijit) is 
the number of times we select policy when up to time 
Vj = 1,2; Xj{t) is the sum of every sample mean for Ki 
plays up to time t\ Xi si is the sum of every sample mean for 
Kg. times using policy 7r,j. 

01.82} implies that 



The condition {— !^ + Ct.si < 



X2,: 



at least one of the following must hold: 

X\ G\ 

' <Ul- — Ct.si 



Si 



K„ 



(22) 



7 



-^>U2 + ^+ ' r (23) 

S2 J<q U2-C2/Kq 

Cl C2 , U2+C2lK„, 

1 - ^ < C/2 + # + (1 + ' )CM. (24) 



K, 



K. 



U2-C2/K,' 



Note that 



(25) 



where Ai i is sample average reward for the ifh time selecting 
policy TTi. 

Due to the definition of a and Kg, we have 

C/i - < E[ii.] < C/i + V^><z (26) 
-Kg Ag 

Then applying Lemma [T] and the results in ( fTSl ) and ( fT9b , 

™^ '''' <U,~-^-Ct,sJ 



Sl 



Kn 



( ^ <Ui- — - Ct,sJ 

Sl Kn 







< F( 



-0 + + Ai,2 + ••• + Ai,,, 



Sl 



< exp(-2(u'i - siCt,,.i)V(si 
Similarly, 



(27) 



This concludes the bound in case pn > poi. The derivation 
of the bound is similar for the case when pu < poi with 
the key difference of 7' instead of 7, and the Ci , Ui terms 
being replaced by €2,1/2 and vice versa. Then we have that 
the regret in either case has the following bound: 



i?(P, r!(l), n) < G{n) + {\U2 - Ui\G{n) + d + C2 
max{Ci,C2})( 



{\u2-u,\-^r 

+ 1 + max{7,7'} + y) + max{Ci,C2} 

(32) 

This inequality can be readily translated to the simplified 
form of the bound given in the statement of Theorem 1, where: 



Zi = \U2-U1 



L(l + max{{^,^±g|f^})^ 



(\U2-u,\-^Y 

Z2 = {Cl +C2+ max{Ci, C2})L(1+ 



max{ 



Ui+Ci/Kq U2 + C2/Kq 
Ui-Ci/Kg' U2-C2/Kg 



})Vi\U2-Ui\ 



Cl + C2 .2 
Kn ' 



Z^ = \U2-Ui\{l + max{7, 7'} + y ) + 1 

Z4 - (Cl +C2 + max{Ci, C2})(1 + max{7, Y} + y ) 
+ max{Ci, C2} 



S2 



Kq U2 - C2lKq 



yl2.i + yl2.2 + --- + A2.S., ^ C2 
\ > U2 + — 



U2 + C2/Kg 
U2 - C2/Kg 



S2 



K„ 



< ^^l + --- + l + A2.,^. + --- + A2.s, ^ (28) 



S2 



. C2 , U2+C2/Kg 

^ Kg^ U2-C2/Kg'''^''^ 
< ^^p(-2K + .2Q..J^^ 



S2- q 



Denote X{n) as 



U2 + C2/Kg 



C1 + C2. 



A(n)^r(L(l+ - ' -';;; flrin)/iU,-U2-^^^r] 



(29) 



(30) 



For I > X{n), (|24] | is false. So we get: 

]E(r(n)) < A(n) + 7 + I],^iI]*,.iS*^^i2r 

< A(n)+7+y. 

Therefore, we have: 

R'{P,n{l),n) < G(n)+ 

{{Ui - C/2)G(n) + Cl + C2 + Ci)(AH + 7 + y ) 



(31) 



Remark 1: Theorem 1 has been stated for the cases N = 
2, 3, which are the cases when the Myopic policy has been 
proved to be optimal for the known-parameter case for all 
values of P. However, our proof shows something stronger: 
Algorithm [T] yields the claimed near-logarithmic regret with 
respect to the myopic policy for any N. The Myopic policy 
is known to be always optimal for N = 2,3, and for any 
N so long as the Markov chain is positively correlated. For 
negatively correlated channels, it is an open question whether 
it is optimal for an infinite horizon case (extensive numerical 
examples suggest an affirmative answer to this conjecture). If 
this conjecture is true, the policy we have presented would 
also offer near-logarithmic regret for any N. 

Remark 2: Theorem 1 also holds if the initial belief vector 
is unknown. This is because once every channel is sensed 
once, initial belief is forgotten by the myopic policy, which 
must happen within finite time on average. 

VII. Examples And Simulation Results 

We consider a system that consists of iV = 2 independent 
channels. Each channel evolves as a two-state Markov chain 
with transition probability matrix P. The parameter L is set 
to be 3. We consider several situations with different sequence 
{Kn}^^i and different correlations. 

First we show the simulation results when the channel 
is positively correlated. The transition probabilities are as 
follows: poi = 0.3, pii = 0.7, pm = 0.3, poo — 0.7. The 
sequences are set to be K}^ ~ [100 + ln(n + 2)], Kf^ = 
[100 + ln(ln(n + 2))] and K^^ = [100 + ln(ln(ln(n + 2)))] . 
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Fig. 1. Simulation results when Kn = [100 + ln(n + 2)] 



Fig. 3. Simulation results when K„ = [100 + ln(ln(ln(n + 2))]) 
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Fig. 2. Simulation results when Kn = [100 + ln(ln(n + 2))] 



Fig. 4. Simulation results when K„ = [100 + ln(n + 2)] 



Figure |2] to Figure [3] show the simulation resuh (normalized 
with respect to G{t) \ogt. It is quite clear that the regrets all 
converge to a limit that is below our bound. In our simulations, 
r |t/^-t72| "I ^ already greater than it. In this 

way, the regret can converge more quickly. Practically, it may 
happen that Ki < Then, we have to wait for some 

time so Kn can be sufficiently great. Since Kn goes to the 
infinity, it will exceed [j^J-r^l certain time. The speed of 
convergence depends on how fast Kn grows. If K^ grows 
too slowly, it may take longer time to converge; however, 
if it grows too fast, though the regret converges quickly, the 
upper bound of regret also increases. So there is trade-off here 
between convergence speed and the upper bound of regret. 
Generally Kn should be a sub-linear sequence. 

Next, we show the simulation results when the channel 
is negatively correlated. The transition probabilities are as 
follows: poi = 0.7, pii — 0.3, pio — 0.7, poo = 0.3. We 
use again the sequence K^-^, K^ and Kf-^^. 

Figure |4] to Figure |6] show the simulation of the result 



(normalized with respect to the product of G{t) and the 
logarithm of time). The regrets also converge to a limitation 
and are bounded. The basic conclusion also stands here. 



VIII. Conclusion 

In this study, we have considered the non-Bayesian RMAB 
problem, in which the parameters of Markov chain are un- 
known a priori. We have developed a novel approach to 
solves special cases of this problem that applies whenever 
the corresponding known-parameter Bayesian problem has the 
structure that the optimal solution is one of the prescribed 
finite set of polices depending on the known parameters. 
For such problems, we propose to learn the optimal policy 
by using a meta-policy which treats each policy from that 
finite set as an arm. We have demonstrated our approach 
by developing an original policy for opportunistic spectrum 
access over unknown dynamic channels. We have proved that 
our policy achieves near-logarithmic regret in this case, which 
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Fig. 5. Simulation results when Kn = [100 + ln(ln(n + 2))] 



C/2(P) 



_ ^ Poi(l -Pii) 



(34) 



^ _ 2max{poi,l -piillpii -pQi|(l -pn) 

^ maxjp^i, (1 -pii)^}|pii -poil 
(1- (pii -poi)2)(l-pii +Poi)2 

^ /pN ^ 2max{poi, 1 -pii}|pii -poi|poi 

(1-Pii+P0i)3 ^3^^ 
_^ max{pg^, (1 -pii)^}|pii - Poll 
(1 - (Pii -Poi)^)(l -Pii +poi)^ 

Also, note that a similar but more tedious calculation can 
be done for iV = 3. 
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Fig. 6. Simulation results when Kn = [100 + ln(ln(ln(n + 2))]) 



is the first-such strong regret result in the literature for a non- 
Bayesian RMAB problem. 

While we have demonstrated this meta-policy approach for 
a particular RMAB with two states and identical arms, an open 
question is to identify other RMAB problems that fill into the 
finite option structure, and derive similar results for them. Note 
that even for the problem where the optimal solution does not 
have a finite option structure, but there exists a near-optimal 
policy that has this structure, our approach could be used to 
prove sub-linear regret with respect to the near-optimal policy. 

Appendix A 

Calculation of Ci(P) and Ui{V) for Lemma[2]when 

N ^2 

When TV = 2, we explicitly calculate Ci(P), C2(P), 
C/i(P), t/2(P) as follows: 

, (1 -Poi +Pii)Poi(l -Pii) 



(1 -pii +poi)^ 



(1 -pii +poi)^ 



(33) 
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